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Abstract 

In this work we show that randomized (block) coordinate descent methods can be accelerated 
by parallelization when applied to the problem of minimizing the sum of a partially separable 
smooth convex function and a simple separable convex function. The theoretical speedup, as 
compared to the serial method, and referring to the number of iterations needed to approximately 
solve the problem with high probability, is a simple expression depending on the number of 
parallel processors and a natural and easily computable measure of separability of the smooth 
component of the objective function. In the worst case, when no degree of separability is present, 
there may be no speedup; in the best case, when the problem is separable, the speedup is equal 
to the number of processors. Our analysis also works in the mode when the number of blocks 
being updated at each iteration is random, which allows for modeling situations with busy or 
unreliable processors. We show that our algorithm is able to solve a LASSO problem involving 
a matrix with 20 billion nonzeros in 2 hours on a large memory node with 24 cores. 

Keywords: Parallel coordinate descent, big data optimization, partial separability, huge- 
scale optimization, iteration complexity, expected separable over-approximation, composite ob- 
jective, convex optimization, LASSO. 

1 Introduction 

Big data optimization. Recently there has been a surge in interest in the design of algorithms 
suitable for solving convex optimization problems with a huge number of variables |13| l8]. Indeed, 
the size of problems arising in fields such as machine learning, network analysis, PDEs, truss topology 
design and compressed sensing usually grows with our capacity to solve them, and is projected to 
grow dramatically in the next decade. In fact, much of computational science is currently facing 
the "big data" challenge, and this work is aimed at developing optimization algorithms suitable for 
the task. 

*The work of the first author was supported by EPSRC grants EP/J020567/1 (Algorithms for Data Simplicity) 
and EP/I017127/1 (Mathematics for Vast Digital Resources). The second author was supported by the Centre for 
Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding 
Council). An open source code with an efficient implementation of the algorithm(s) developed in this paper is 
published here: http ://code.google.eom/p/ac-dc/. | 
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Coordinate descent methods. Coordinate descent methods (CDM) are one of the most suc- 
cessful classes of algorithms in the big data optimization domain. Broadly speaking, CDMs are 
based on the strategy of updating a single coordinate (or a single block of coordinates) of the vector 
of variables at each iteration. This often drastically reduces memory requirements as well as the 
arithmetic complexity of a single iteration, making the methods easily implementable and scalable. 
In certain applications, a single iteration can amount to as few as 4 multiplications and additions 
only [llj ! On the other hand, many more iterations are necessary for convergence than it is usual 
for classical gradient methods. Indeed, the number of iterations a CDM requires to solve a smooth 
convex optimization problem is 0{ ^^^ ), where e is the error tolerance, n is the number variables 
(or blocks of variables), L is the average of the Lipschitz constants of the gradient of the objective 
function associated with the variables (blocks of variables) and R is the distance from the starting 
iterate to the set of optimal solutions. On balance, as observed by numerous authors, serial CDMs 
are much more efficient for big data optimization problems than most other competing approaches, 
such as gradient methods [71 [11] . 



Parallelization. We wish to point out that for truly huge-scale problems it is absolutely necessary 
to parallelize. This is in line with the rise and ever increasing availability of high performance 
computing systems built around multi-core processors, CPU-accelerators and computer clusters, 
the success of which is rooted in massive parallelization. This simple observation, combined with 
the remarkable scalability of serial CDMs, leads to our belief that the study of parallel coordinate 
descent methods (PCDMs) is a very timely topic. 

Research Idea. The work presented in this paper was motivated by the desire to answer the 
following question: 

Under what natural and easily verifiable structural assumptions on the objective function 
does parallelization of a coordinate descent method lead to acceleration? 

Our starting point was the following simple observation. Assume that we wish to minimize a 
separable function F of n variables (i.e., a function that can be written as a sum of n functions 
each of which depends on a single variable only). The problem of minimizing F can be trivially 
decomposed into n independent univariate problems. Now, if we have n processors/threads/cores, 
each assigned with the task of solving one of these problems, the number of parallel iterations should 
not depend on the dimension of the probleixl^ In other words, we get an n-times speedup compared 
to the situation with a single processor only. Note that any parallel algorithm of this type can 
be viewed as a parallel coordinate descent method. Hence, a PCDM with n processors should be 
n-times faster than a serial one. If r processors are used instead, where 1 < t < n, one would 
expect a r-times speedup. 

By extension, one would perhaps expect that optimization problems with objective functions 
which are "close to being separable" would also be amenable to acceleration by parallelization, where 
the acceleration factor r would be reduced with the reduction of the "degree of separability". One 
of the main messages of this paper is an affirmative answer to this. Moreover, we give explicit and 
simple formulae for the speedup factors. 

'^For simplicity, assume the distance from the starting point to the set of optimal solutions does not depend on 
the dimension. 
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As it turns out, and as we discuss later in this section, many real- world big data optimization 
problems are, quite naturally, "close to being separable". We believe that this means that PCDMs 
is a very promising class of algorithms when it comes to solving structured big data optimization 
problems. 

Minimizing a partially separable composite objective. In this paper we study the problem 

minimize f{x) + n{x) subject to X e R^, (1) 

V ' 

= F{x) 

where / is a (block) partially separable smooth convex function and is a simple (block) separable 
convex function. We allow J7 to have values in RU {00} , and for regularization purposes we assume 
f2 is proper and closed. While ([T| is seemingly an unconstrained problem, Q can be chosen to model 
simple convex constraints on individual blocks of variables. Alternatively, this function can be used 
to enforce a certain structure (e.g., sparsity) in the solution. For a more detailed account we refer 
the reader to [13]. Further, we assume that this problem has a minimum {F* > —00). What we 
mean by "smoothness" and "simplicity" will be made precise in the next section. 

Let us now describe the key concept of partial separability. Let x G R^ be decomposed into 
n non-overlapping blocks of variables x^^\ . . . (this will be made precise in Section |2|. We 

assume throughout the paper that / : R''^ — )• R is partially separable of degree uj, i.e., that it can 
be written in the form 

f{x) = Y.fj{x), (2) 

Jej 

def 

where ^7 is a finite collection of nonempty subsets of [n] = {1, 2, . . . , n} (possibly containing identical 
sets multiple times), fj are differentiable convex functions such that fj depends on blocks x^*^ for 
i € J only, and 

|J|<w for all JeJ. (3) 

Clearly, 1 < w < n. Most of the many variations of the basic PCDM we analyze in this paper do 
not require the decomposition ^ to be known; all we need to know is co, which is very often an 
easily computable quantity. 

Examples of partially separable functions. Many objective functions naturally encountered in 
the big data setting are partially separable. Here we give examples of three loss/objective functions 
frequently used in the machine learning literature and also elsewhere. For simplicity, we assume all 
blocks are of size 1 (i.e., N = n). Let 

m 

/(x) = ^£(x,A,-,y,), (4) 
i=i 

where m is the number of examples, x E R" is the vector of features, (Aj, yj) £ R" x R are labeled 
examples and C is one of the three loss functions listed in Table [l] Let A £ with row j equal 

to AJ. Often, each example depends on a few features only; the maximum over all features is the 
degree of partial separability 00. More formally, note that the j-th function in the sum ^ in all 
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Square Loss 




Logistic Loss 


log(l + e-y^^'i'') 


Hinge Square Loss 


\ max{0, 1 - yjAjx}'^ 



Table 1 : Three examples of loss of functions 



cases depends on ||Aj||o coordinates of x (the number of nonzeros in the j-th row of A) and hence 
/ is partially separable of degree 

Lo = max IIq. 
j 

All three functions of Table [T] are smooth (as per the definition in the next section) . We refer the 
reader to |9] for more examples of interesting (but nonsmooth) partially separable functions arising 
in graph cuts and matrix completion. 

Brief literature review. Several papers were written recently studying the iteration complexity 
of serial CDMs of various flavours and in various settings. We will only provide a brief summary 
here, for a more detailed account we refer the reader to |13) . 

Classical CDMs update the coordinates in a cyclic order; the first attempt at analyzing the 
complexity of such a method is due to [15j. Stochastic/randomized CDMs, that is, methods where 
the coordinate to be updated is chosen randomly, were first analyzed for quadratic objectives |17| 
[2], later independently generalized to Li-regularized problems |16j and smooth block-structured 
problems [7], and finally unified and refined in |12| 113) . The problems considered in the above 
papers are either unconstrained or have (block) separable constraints. Recently, randomized CDMs 
were developed for problems with linearly coupled constraints [H [5] . 

A greedy CDM for Li-regularized problems was first analyzed in |T1] and a CDM with inexact 
updates was first proposed and analyzed in [10]. Partially separable problems were independently 
studied in [9j, where an asynchronous parallel stochastic gradient algorithm was developed to solve 
them. Parallel CDMs were proposed and analyzed in [Tl|3]. 

Contents. We start in Section [2] by describing the block structure of the problem, establishing 
notation and detailing assumptions. Subsequently we propose and comment in detail on two parallel 
coordinate descent methods. Section [2] ends with a list of some of the contributions/findings of this 
paper. In Section [3] we deal with issues related to the selection of the blocks to be updated in each 
iteration. It will involve the development of some elementary random set theory. Sections [4]|5] deal 
with issues related to the computation of the update to the selected blocks and develop a theory 
of Expected Separable Overapproximation (ESO), which is a novel tool we propose for the analysis 
of our algorithms. In Section |6] we analyze the iteration complexity of our methods and finally. 
Section [7] reports on some very promising preliminary computational experiments with a big data 
(cca 350GB) LASSO problem with a billion variables. We are able to solve the problem using one 
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of our methods on a large memory machine with 24 cores in 2 hours, pushing the difference between 
the objective value at the starting iterate and the optimal point from 10^^ down to 10~^^. 

2 Parallel Block Coordinate Descent Methods 



In Section 2.1 we formalize the block structure of the problem, establish notation that will be used 



in the rest of the paper and list assumptions. In Section[2.2|we propose two parallel block coordinate 



descent methods and comment in some detail on the steps. Finally, in Section 2.3 we enumerate 
some of the contributions/findings of this paper. 

2.1 Block structure, notation and assumptions 

The block structure of ([T]) is given by a decomposition of into n subspaces as follows. Let U S 
j^NxN ^ column permutation of the N x N identity matrix and further let U = [Ui, U2, • • • , Un] 
be a decomposition of U into n submatrices, with Ui being of size x Ni, where Y2i — 

Proposition 1 (Block decomposition). Any vector x G can he written uniquely as 

n 

x = ^C/,x«, (5) 

i=l 

where x*^*) € Rj = R^*. Moreover, x*^*) = U^x. 

Proof. Noting that UU'^ = UiUf is the N x N identity matrix, we have x = UiUfx. Let us 
now show uniqueness. Assume that x = Uix^^^ = f/jXg^ where x[*\x2'' G Ri- Since 

rp I iV,- X iV- identity matrix, if z = 7, 

I Nj X Ni zero matrix, otherwise, 

for every j we get = Uj'ix — x) = J2i Ui{xf' — X2 ■*) = x^"''' — X2"''*. □ 

In view of the above proposition, from now on we write x^*) '= Ufx G Rj, and refer to x^*^ as 
the i-th block of x. The definition of partial separability in the introduction is with respect to these 
blocks. For simplicity, we will sometimes write x = (x'--'^^ . . . , x*-")). 

Projection onto a set of blocks. For S C [n] and x G R^ we write 

^^fj^C/.xW. (7) 



X[S\ 



That is, given x G R'^, xjg] is the vector in R^ whose blocks i G 5 are identical to those of x, 
but whose other blocks are zeroed out. In view of Proposition [T| we can equivalently define X[5] 
block-by-block as follows 

= /^^^^' ^ e ^' (8) 

1 (G Ri), otherwise. 



5 



Inner products. The standard Euclidean inner product in spaces and Rj, i G [n], will be 
denoted by (•, •). Letting x,y ^ , the relationship between these inner products is given by 



j=l i=l j=l 1=1 

,N 



1=1 



For any w G R" and x, y G R we further define 



{x,y)w = ^Wi{x^'\y 



i=l 



(9) 



For vectors z = (zi, . . . , Zn)^ and w = {wi, . . . , Wn)^ we write u; z =^ ( 



WiZi, . . .,WnZ„ 



Norms. Spaces Rj, i G [n], are equipped with a pair of conjugate norms: and \\t\\*^,^ 

max||s|| <i(s,t), t G Rj. For w G R++, define a pair of conjugate norms in R^ by 



def 



X 



1/2 



\v\\w = niax 
If!|u.<i 



.i=l 



1/2 



(10) 



def 



Often we will use w = L = (Li, L2, . . . , L^) , where the constants Li are defined below. 



Smoothness of /. We assume throughout the paper that the gradient of / is block Lipschitz, 
uniformly in x, with positive constants Li, . . . , L„, i.e., that for all x G R^, i G [n] and t G Ri, 



|Vi/(x + C/.t)-Vi/(x)||T,) <L,||t||(i) 



(11) 



where Vif{x) = (V/(x))(*) = UfVfix) G Ri. An important consequence of (llip is the following 
standard inequality 

fix + U^t) < f{x) + {VJ{x),t) + ^\\t\\ly (12) 



Separability of Q. We assume that Q : R^ — t- R U {+00} is (block) separable, i.e., that it can 
be decomposed as follows: 



n(x) 



2: a 

1=1 



(13) 



where the functions : Rj — )• RU {+cxd} are convex and closed. 



Strong convexity. In one of our two complexity results (Theorem 20 ) we will assume that either 
/ or (or both) is strongly convex. A function (/> : R^ — )■ R U {+00} is strongly convex with 
respect to the norm || • ||^ with convexity parameter lJ,(f,{w) > if for all x,y £ dom(p, 



Hy) > H^) + {'l>'ix),y - x) + t^\\y - x\ 



2 

W ) 



(14) 
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where (j)'{x) is any subgradient of <j) at x. The case with ^(f,{w) = reduces to convexity. Strong 
convexity of F may come from / or O (or both); we write ^f{w) (resp. fi^{w)) for the (strong) 
convexity parameter of / (resp. Q). It follows from (|14[) that 



fJ-F{w) > fif{w) + fj,n{w). (15) 
The following characterization of strong convexity will be useful. For all x,y G dome/) and 

Ae [0,1], 

<P{Xx + (1 - X)y) < + (1 - A)</.(y) - l^iM^\\^ _ yf^, (ig) 

It can be shown using (12) and (14) that fJ-f{'w) < 

2.2 Algorithms 

In this paper we develop and study two generic parallel coordinate descent methods. The main 
method is PCDMl; PCDM2 is its "regularized" version which explicitly enforces monotonicity. As 
we will see, both of these methods come in many variations, depending on how Step 3 is performed. 



Algorithm 1 Parallel Coordinate Descent Method 1 (PCDMl) 

1: Choose initial point xq G 
2: for k = 0,l,2,... do 

3: Randomly generate a set of blocks C {1, 2, . . . , n} 
4: Xk+i ^ Xk + {h{xk))[s^:] 
5: end for 



Algorithm 2 Parallel Coordinate Descent Method 2 (PCDM2) 

1: Choose initial point xq G 
2: for A; = 0,l,2, ... do 

3: Randomly generate a set of blocks C {1, 2, . . . , n} 

4: Xk+l ^ Xk + {h{xk))[St,] 

5: If F{xk+i) > F{xk), then 

Xk+l ^ Xk 

6: end for 



Let us comment on the individual steps of the two methods. 

Step 3. At the beginning of iteration k we pick a random set {Sk) of blocks to be updated (in 
parallel) during that iteration; is a realization of a random set-valued mapping S with values 
in 2["1. For brevity, in this paper we refer to such a mapping by the name sampling. We limit our 
attention to uniform samplings, i.e., random sets having the following property: P(z 6 5) = const 
for all blocks i. That is, the probability that a block gets selected is the same for all blocks. Although 
we give an iteration complexity result covering all such samplings (provided that each block has a 
chance to be updated, i.e., const > 0), there are interesting subclasses of uniform samplings (such 
as doubly uniform and nonoverlapping uniform samplings; see Section [3]) for which we give better 
results. 

Step 4. For x G R^ we define 

h{x) =^ arg min Ha w{x, h), (17) 
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where 

Hp^^{x, h) fix) + {Vf{x),h) + l\\h\\l + n{x + h), (18) 

and P > 0, w = (wi, . . . , Wn)'^ G ^'++ parameters of the method that we will comment on later. 
Note that in view of ([s]), (10) and (13), Hp^^(x, •) is block separable; 



n 

Hp,u,{x, h) = fix) + J2{{V^f{x), + + + 



1=1 



Consequently, we have h{x) = {h^^\x), ■ ■ ■ , h^'^\x)) S R^, where 

= argmin{(Vi/(x),t) + ^p||2 , J7i(x« + t)}. 

We mentioned in the introduction that besides (block) separability, we require i7 to be "simple". 
By this we mean that the above optimization problem leading to h^'''\x) is "simple" (e.g., it has a 
closed-form solution). Recall from ([s]) that (/i(a;fc))[5j.] is the vector in identical to h{xj.) except 
for blocks i ^ Sk, which are zeroed out. Hence, Step 4 of both methods can be written as follows: 

In parallel for i ^ Sk do: x^i+i ^ ^k^ + h^^\xk)- 
Parameters /3 and w depend on / and S and stay constant throughout the algorithm. We 



are not ready yet to explain why the update is computed via (17) and (18) because we need 
technical tools, which will be developed in Section |3] to do so. Here it suffices to say that the 
parameters (5 and w come from a separable quadratic overapproximation of E[/(x + ^[gj)], viewed 
as a function of /i G R^. Since expectation is involved, we refer to this by the name Expected 
Separable Overapproximation (ESO). This novel concept, developed in this paper, is one of the 
main tools of our complexity analysis. Section |4] motivates and formalizes the concept, answers the 
why question, and develops some basic ESO theory. 

Section [5] is devoted to the computation of /3 and w for partially separable / and various special 
classes of uniform samplings S. Typically we will have Wi = Li, while /3 will depend on easily 
computable properties of / and S. For example, if S is chosen as a subset of [n] of cardinality r, 
with each subset chosen with the same probability (we say that S is r-nice) then, assuming n > 1, 
we may choose w = L and /3 = 1 + ^'^~^}_'[~^^ , where w is the degree of partial separability of /. 
More generally, if S is any uniform sampling with the property 15"! = r with probability 1, then we 
may choose w = L and (3 = min{cL!,T}. Note that in both cases w = L and that the latter f3 is 
always larger than (or equal to) the former one. This means, as we will see in Section [Hj that we 
can give better complexity results for the former, more specialized, sampling. We analyze several 
more options for S than the two just described, and compute parameters /3 and w that should be 
used with them (for a summary, see Table [2]) . 

Step 5. The reason why, besides PCDMl, we also consider PCDM2, is the following: in some 
situations we are not able to analyze the iteration complexity of PCDMl (non-strongly-convex F 
where monotonicity of the method is not guaranteed by other means than by directly enforcing it 
by inclusion of Step 5). Let us remark that this issue arises for general Q only. It does not exist for 
= 0, Q{-) = A|| • 111 and for 0, encoding simple constraints on individual blocks; in these cases one 
does not need to consider PCDM2. Even in the case of general Q we sometimes get monotonicity for 
free, in which case there is no need to enforce it. Let us stress, however, that we do not recommend 
implementing PCDM2 as this would introduce too much overhead; in our experience PCDMl works 
well even in cases when we can only analyze PCDM2. 
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2.3 Contributions 

Here is a list of what we believe are the most notable contributions of this paper (not in order of 
significance) : 

1. Problem generality. We give the first complexity analysis for a parallel coordinate descent 
method for problem ([T]) in its full generality. 

2. Relationship to existing results. To the best of our knowledge, there are just two papers 
analyzing a parallel coordinate descent algorithm [Tl[3]. In the first paper all blocks are of size 
1, S corresponds to what we call in this paper a T-nice sampling (i.e., all sets of r coordinates 
are updated at each iteration with equal probability) and hence their algorithm is somewhat 
comparable to one of the many variants of our general method. While the analysis in [Ij 
works for a restricted range of values of r, our results hold for all r G [n]. Moreover, the 
authors consider a more restricted class of functions / and the special case = A||3;||i, which 
is simpler to analyze. Lastly, the theoretical speedups obtained in [1], when compared to the 
serial CDM method, depend on a quantity that is hard to compute in big data settings (it 
involves the computation of an eigenvalue of a huge-scale matrix). Our speedups are expressed 
in terms of natural and easily computable quantity: the degree u of partial separability of /. 
The parallel CDM method of the second paper only allows all blocks to be updated at each 
iteration. Unfortunately, the analysis in [3j is too coarse as it does not offer any theoretical 
speedup when compared to its serial counterpart. 

3. Partial separability. We give the first analysis of a coordinate descent type method dealing 
with a partially separable objective. In order to run the method, all we need to know about 
/ are the Lipschitz constants Li and the degree of partial separability u. It is crucial that 
these quantities are easily computable/predictable in the huge-scale setting. Indeed, if f{x) = 
^\\Ax — 6|p and we choose all blocks to be of size 1, then Li is equal to the squared Euclidean 
norm of the i-th column of A and lo is equal to the maximum over the cardinalities of (i.e., 
the number of nonzeros in) the rows of A. 

4. Algorithm unification. Depending on the choice of the block structure (as implied by the 
choice of n and the matrices Ui, . . . , Un) and the way blocks are selected at every iteration 
(as given by the choice of 5"), our methods encode a range of known and new algorithms. For 
example, in special cases, our method reduces to the serial coordinate descent method {Ni = 1 
for all i and IS"! = 1 with probability 1), serial block coordinate descent method (IS"! = 1 
with probability 1), PCDM with variable number of updates per iteration, distributed block 
coordinate descent method (this will arise for a special choice of 5", but will not discuss it 
here as a follow up paper is currently being prepared), gradient descent method (for n = 1 or 
\S\ = n) and more. In particular, we give the first analysis of a method which "continuously" 
interpolates between a serial coordinate descent method and the gradient method. 

5. Expected Separable Overapproximation (ESO). En route to proving the iteration com- 
plexity results for our algorithms, we develop a theory of deterministic and expected separable 
overapproximation (Sections |4] and |5]). We believe this is of independent interest; for instance, 
methods based on ESO can be compared favorably to the Diagonal Quadratic Approximation 
(DQA) approach used in the decomposition of stochastic optimization programs |14j . 
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6. Variable number of updates per iteration. We give the first analysis of a PCDM wliich 
allows for a variable number of blocks to be updated throughout the iterations. This may be 
useful in some settings such as when the problem is being solved in parallel by r unreliable 
processors each of which computes its update h^^\xk) with probability pb and is busy/down 
with probability 1 — Pb (read about binomial sampling in Section [3|. 

7. Parallelization speedup. We show theoretically (Section|6]) and numerically (SectionjTj) that 
PCDM accelerates on its serial counterpart for partially separable problems. The theoretical 
parallelization speedup factors for several variants of our method (in case of a non-strongly 
convex objective) are given in Table |3j For instance, in the case when all block are updated at 
each iteration (we later refer to S having this property by the name fully parallel sampling), 
the speedup factor is equal to ^. If the problem is separable {u = 1), the speedup is equal to 
n; if the problem is not separable (w = n), there may be no speedup. For strongly convex F 



the situation is even better; the details are given in Section 6.2 



Computations. We demonstrate that our method is able to solve a LASSO problem involving 
a matrix with a billion columns and 2 billion rows on a large memory node with 24 cores in 
2 hours (Section [7|. 

Code. The open source code with an efficient implementation of the algorithm(s) developed 
in this paper is published here: ,ht tp : / /code . google . com/p/ ac-dc / . , 



3 Block Samplings 

In Step 3 of both PCDMl and PCDM2 we choose a random set of blocks Sk to be updated at 
the current iteration. Formally, Sk is a realization of a random set-valued mapping S with values 
in 2["]. For brevity, in this paper we refer to 5* by the name sampling. A sampling S is uniquely 
characterized by the probability density function 

P{S) = P{S = S), S C [n]; (19) 
that is, by assigning probabilities to all subsets of [n]. Further, we let p = [pi, . . . ,Pn)'^ , where 

p/=P(iG5). (20) 



In Section 3.1 we describe those samplings for which we analyze our methods. In Section 3.2 



we 



prove several technical results, which will be useful in the rest of the paper. 

3.1 Uniform, Doubly Uniform and Nonoverlapping Uniform Samplings 

A sampling is proper if > for all blocks i. That is, from the perspective of PCDM, under 
a proper sampling each block gets updated with a positive probability at each iteration. Clearly, 
PCDM can not converge for a sampling that is not proper. 

A sampling S is uniform if all blocks get updated with the same probability, i.e., if pi = const. 



'E\\S\] 

We show in (32) that, necessarily, pi = ^' . Further, we say S is nil if P(0) = 1. Note that a 



uniform sampling is proper if and only if it is not nil. 
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All our iteration complexity results in this paper are for PCDM used with a proper uniform 



sampling (see Theorems 19 and 20) for which we can compute (3 and w giving rise to (18). We 
give a general complexity result covering all proper uniform samplings (Theorem [l2| as well as 
refined results for two special subclasses thereof: doubly uniform samplings (Theorem [15|) and 



nonoverlapping uniform samplings (Theorem 13). We will now give the definitions: 



1. Doubly Uniform (DU) samplings. A DU sampling is one which generates all sets of equal 
cardinality with equal probability. That is, P(5") = P(S"') whenever \S'\ = \S"\. The name 
comes from the fact that this definition postulates a different uniformity property, "standard" 
uniformity is a consequence. Indeed, let us show that a DU sampling is necessarily uniform. 
Let Qj = PdiS*! = j) for j = 0,1,..., n and note that from the definition we know that 
whenever S is of cardinality j, we have P{S) = qj/{^)- Finally, using this we obtain 

TTi Th TL Th 

5:jeS j = 15:je5 j = lS:i<^S^ j = l ^ j = l 

\S\=j \S\=j 

It is clear that each DU sampling is uniquely characterized by the vector of probabilities q; 
its density function is given by 

P(5) = ^, SC[n]. (21) 
[\s\) 

2. Nonoverlapping Uniform (NU) samplings. A NU sampling is one which is uniform and 
which assigns positive probabilities only to sets forming a partition of [n]. Let 5^, 5"^, . . . , S"' be 
a partition of [n], with > for all j. The density function of a NU sampling corresponding 
to this partition is given by 

P(5) = /7' ■iiSG{S\S^...,S'}, ^22) 
I 0, otherwise. 

Note that £[151] = f . 
Let us now describe several interesting special cases of DU and NU samplings: 

3. Nice sampling. Fix 1 < r < n. A r-nice sampling is a DU sampling with Qt = 1. 

Interpretation: There are r processors/threads/cores available. At the beginning of each 
iteration we choose a set of blocks using a r-nice sampling (i.e., each subset of r blocks is 
chosen with the same probability), and assign each block to a dedicated processor/thread/core. 
Processor assigned with block i would compute and apply the update h^^\xk)- This is the 
sampling we use in our computational experiments. 

4. Independent sampling. Fix 1 < r < n. A r-independent sampling is a DU sampling with 



Qk 




k = l,2,...,T, 

k = T + 1, . . . ,n, 
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where c, = (i)^ and c, = {^Y - Y.tl ( for k > 2. 

Interpretation: There are r processors/threads/cores available. Each processor chooses one 
of the n blocks, uniformly at random and independently of the other processors. It turns out 
that the set S of blocks selected this way is DU with q as given above. Since in one parallel 
iteration of our methods each block in S is updated exactly once, this means that if two or 
more processors pick the same block, all but one will be idle. On the other hand, this sampling 
can be generated extremely easily and in parallel! For r <C n this sampling is a good (and 
fast) approximation of the r-nice sampling. For instance, for n = 10^ and r = 8 we have 
qs = 0.9723, qr = 0.0274, qe = 0.0003 and for A: = 1, . . . , 5. 

5. Binomial sampling. Fix 1 < r < n and < pf, < 1. A (r,pf,)-binomial sampling is defined 
as a DU sampling with 

9fc= Qpfe(l-Pfe)', k = 0,l,...,T. (23) 
Notice that £[151] = rpb and E[\S\'^] = Tpb{l + rpb-pb)- 

Interpretation: Consider the following situation with independent equally unreliable processors. 
We have r processors, each of which is at any given moment available with probability pb and 
busy with probability 1 — Pb, independently of the availability of the other processors. Hence, 
the number of available processors (and hence blocks that can be updated in parallel) at each 
iteration is a binomial random variable with parameters r and pb- That is, the number of 
available processors is equal to k with probability qk- 

— Case 1 (explicit selection of blocks): We learn that k processors are available at the 
beginning of each iteration. Subsequently, we choose k blocks using a k-m.ce sampling 
and "assign one block" to each of the k available processors. 

— Case 2 (implicit selection of blocks): We choose r blocks using a r-nice sampling and 
assign one to each of the r processors (we do not know which will be available at the 
beginning of the iteration). With probability q^^ k of these will send their updates. It is 
easy to check that the resulting effective sampling of blocks is (r, pfc)-binomial. 

6. Serial sampling. This is a DU sampling with qi = 1. Also, this is a NU sampling with I = n 
and = {j} for j = 1, 2, . . . , /. That is, at each iteration we update a single block, uniformly 
at random. This was studied in |13j . 

7. Fully parallel sampling. This is a DU sampling with q^ = 1. Also, this is a NU sampling 
with I = 1 and = [n]. That is, at each iteration we update all blocks. 

The following simple result says that the intersection between the class of DU and NU samplings 
is very thin. A sampling is called vacuous if P(0) > 0. 

Proposition 2. There are precisely two nonvacuous samplings which are both DU and NU: i) the 

serial sampling and ii) the fully parallel sampling. 

Proof. Assume S is nonvacuous, NU and DU. Since S is nonvacuous, F{S = 0) = 0. Let S C [n] 
be any set for which P{S = S) > 0. If 1 < \S\ < n, then there exists S" 7^ 5 of the same 
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cardinality as S having a nonempty intersection with S. Since S is doubly uniform, we must have 
P(5' = S') = P{S = S') > 0. However, this contradicts the fact that S is non-overlapping. Hence, 
S can only generate sets of cardinalities 1 or n with positive probability, but not both. One option 
leads to the fully parallel sampling, the other one leads to the serial sampling. □ 

3.2 Technical results 

For a given sampling S and [n] we let 



p./= P(iG5,iG5)= ^ P(S). 

s.{i,j}cs 

The following simple result has several consequences which will be used throughout the paper. 



(24) 



Lemma 3 (Sum over a random index set). Let 7^ J C [n] and S be any sampling. If 6i, i £ [n], 
and 9ij, for € [n] x [n] are real constants, thei^ 



E 



ieJnS 



E 



9i\\JnS\=k 



Y^P{ieS\\JnS\ = k)9i, 



(25) 



jeJns jeJns J J 
Proof We prove the first statement, proof of the remaining statements is essentially identical: 



(26) 



E 



ieJns 



^ E f E p(^) = E E ^^p(^) = E^^ E p(^) = E^^ 



□ 



E 
E 



E 



iJn^l 



The consequences are summarized in the next theorem and the discussion that follows. 

Theorem 4. Let 7^ J C [n] and S be an arbitrary sampling. Further, let a,h £ R^, w G R" and 
let g be a block separable function, i.e., g{x) = gi^x^^^). Then 

{a, h)pQw, 

|2 ' 



E 



h, 



E 



n 

E[ra^(x«+/i^')) + (l-p.)9^(x«) 



(27) 

(28) 

(29) 
(30) 

(31) 



i=l 



Sum over an empty index set will, for convenience, be defined to be zero. 
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def 

Moreover, the matrix P = {pij) is positive semidefinite. 

Proof. Noting that | J n 5| = EieJn5 1, 1^ n = (^^^.^^ 1)2 = ^^^^^^^ ^^^^^^ 1, (a, h^^^) 



i£S 10 ies «=i ies 

all five identities follow directly by applying Lemma [s] Finally, for any ^ = (^i, . o \T 



9nV G R", 



The above results hold for arbitrary samplings. Let us specialize them, in order of decreasing 
generality, to uniform, doubly uniform and nice samplings. 



Uniform samplings. If S is uniform, then from (27) using J = [n] we get 

Ef|5|l 



Pi 



Plugging (pl) into (p7|), (p9|), (|30|) and ^ yields 

|Jn5|l =^E[|5|], 



E 



E 



E 



E 



9{x + h^S]) 



>'[S]\\w 

n\s\] 



E[|g|J||,||2 



g{x + h) + (l 



n\s\] 



g{x)- 



Doubly uniform samplings. For doubly uniform S, pij is constant for i ^ j: 



Pij 



• E[|5p-|5|] 
n{n—l) 



n> 1, 



^0 n = l. 

Indeed (consider n > 1, case n = 1 is trivial), this follows from 



p,, = Y^p{{^,j} cs\\s\= k)P{\s\ = k) = Y^ gSyPd^l = 



k=l 



k=l 



Substituting (37) and (29) into (28) then gives 

|2i _ n T|2 I Tl^ E[|5p-|5|] , I 



E 



|Jng|1 = (|J|^-|J|) ^XliT-i} + l-^|- 



(32) 

(33) 

(34) 
(35) 
(36) 



(37) 



(38) 
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Nice samplings. Finally, if S is r-nice (and r 7^ 0), then £[151] = r and E[|S'p] = r^, which 

E[|Jn^p] = Mifi + 



used in ( 38 ) gives 



(\J\-l)ir-l) 



max{l,ri— 1} I 

(39) 

Moreover, assume that P(|Jn S\ = k) ^ (this happens precisely when < k < \ J\ and 
k < T < n — \ J\ + k). Then for all i ^ J, 



PiieS\\JnS\ = k) 



Substituting this into (25) yields 



E 



ieJn5 



\Jr\S\ = k 



_k_ 
\J\ 



(40) 



4 Expected Separable Overapproximation 

Recall that given x^, in PCDMl the next iterate is the random vector x^+i 
particular choice of /i G R^. Further recall that in PCDM2, 



Xk + for a 



Xk+l 



Xk + h^gy if F(xfc + /lj_5]) < F(XA:), 

iXfc, otherwise, 

again for a particular choice of h. While in Section |2] we mentioned how h is computed, i.e., that h 
is the minimizer of Hj^^y^^x, •) (see (17) and (18)), we did not explain why is h computed this way. 
The reason for this is that the tools needed for this were not yet developed at that point (as we will 
see, some results from Section |3] are needed). In this section we give an answer to this why question. 

Given x^ G R^, after one step of PCDMl performed with update h we get E[F(xfc+i) | x^] = 
E[F(j;fc + /i[^]) I Xk]- On the the other hand, after one step of PCDM2 we have 

E[F(xfe+i) I Xk] = E[mm{F{xk + h^g^), F{xk)} \ Xk] < mm{E[F{xk + h^§^) \ Xk],F{xk)}. 
So, for both PCDMl and PCDM2 the following estimate holds, 

E[F(xfc+i) I Xk] < E[F{xk + /i[5]) I Xk]. (41) 
A good choice for h to be used in the algorithms would be one minimizing the right hand side of 



inequality (41). At the same time, we would like the minimization process to be decomposable 
so that the updates h^^\ i G 5, could be computed in parallel. However, the problem of finding 
such h is intractable in general even if we do not require parallelizability. Instead, we propose to 



construct /compute a "simple" separable overapproximation of the right-hand side of (41). Since 
the overapproximation will be separable, parallelizability is guaranteed; "simplicity" means that the 
updates /i^*^ can be computed easily (e.g., in closed form). 

From now on we replace, for simplicity and w.l.o.g., the random vector Xk by a fixed deterministic 
vector X G R^. We can thus remove conditioning in (41) and instead study the quantity E[F(x + 



/ij^j)]. Further, fix /i G R^. Note that if we can find /? > and w G R"_|- such that 



E 



fix + h 



[S] 



< 



f{x) + '^({Vf{x),h) + l\\h\ 



(42) 



15 



we indeed find a simple separable overapproximation of 'E[F{x + h 







E[/(x + /lrA,)+J](x+/lr 



l|42ll,||36l 



EM 



(V/(x),/i) + f 



+ 1 



EM 



+ 



EM 



1 



EM 



Q{x + h) 
(43) 



where we recall from ^ that Hp^^ix, h) = fix) + {Vfix),h) + + + h). 

That is, (43) says that the expected objective value after one parallel step of our methods, if 



block i G 5" is updated by h^'^\ is bounded above by a convex combination of F{x) and Hp.uj{x^ h). 
The natural choice of h is to set 



h{x) = arg min Ha ^{x, h). 



(44) 



Note that this is precisely the choice we make in our methods. Since ff^^^(x,0) = F{x), both 
PCDMl and PCDM2 are monotonic in expectation. 

The above discussion leads to the following definition. 

Definition 5 (Expected Separable Overapproximation (ESO)). Let l3 > 0, w & '^++ S be 

a proper uniform sampling. We say that / : — )• R admits a (/3, ti;)-ESO with respect to S if 



inequality (42) holds for all x, /i G R . For simplicity, we write {f,S) ~ ESO{(3,w). 



A few remarks: 

1. Infiation. If (/, S) ~ ESO{(3, w), then for p' > p and w' > w, (/, S) ~ ESO{p', w'). 

2. Reshuffling. Since for any c > we have 



— c||/i||^, one can "shuffle" constants between 



P and w as follows: 

{f,S) ^ ESO{c(3,w) ^ {f,S) ^ ESO{P,cw), c> 0. 

3. Strong convexity. If (/,5) ~ ES0{(3,w), then 

/3 > Hf{w). 



(45) 



(46) 



Indeed, it suffices to take expectation in ( 14 ) with y replaced by x + /ij^j and compare the 



resulting inequality with (42) (this gives > /ij(u;) which must hold for all h). 



Recall that Step 5 of PCDM2 was introduced so as to explicitly enforce monotonicity into the 
method as in some situations, as we will see in Section [6] we can only analyze a monotonic algo- 
rithm. However, sometimes even PCDMl behaves monotonically (without enforcing this behavior 
externally as in PCDM2). The following definition captures this. 



Definition 6 (Monotonic ESO). Assume (/, 5) ~ ES0{f3,w) and let h{x) be as in (|44]). We say 
that the ESO is monotonic if F{x + {h{x))^e,^) < F{x) for all x G domF. 
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4.1 Deterministic Separable Overapproximation (DSO) of Partially Separable 
Functions 



The following theorem will be useful in deriving ESO for uniform samplings (Section 5.1) and 



nonover lapping uniform samplings (Section 5.2). It will also be useful in establishing monotonicity 



of some ESOs (Theorems 12 and 13). 

Theorem 7 (DSO). Assume f is partially separable (i.e., it can be written in the form Letting 
Supp(/i) = {i G [n] : /i^*) 7^ 0}, for all x,h G we have 



/(X + H) < /(X) + (V/W, h) + max,,^|JnSupp(;,)l ||,||.^ 



(47) 



def 

Proof. Let us fix x and define (j){h) = f{x + h) — f{x) — (V/(x),/i). Fixing h, we need to show 

that (t){h) < l\\h\\l for 9 = msi^j(.j 6-^ , where 9-^ = | J n Supp(/i)|. One can define functions (/>'^ in 
an analogous fashion from the constituent functions fj, which satisfy 



Jej 

(l)J{0) = 0, Jej. 



Note that (12) can be written as 



i = 1,2, . . . ,n. 



(48) 
(49) 

(50) 



Now, since cp"^ depends on the intersection of J and the support of its argument only, we have 



JeJ JeJ \i=i J JeJ 



ie JnSupp(/i) 



(51) 



The argument in the last expression can be written as a convex combination of 1 + vectors: 
the zero vector (with weight ) and the 6'^ vectors {6Uih^^^ : i G JnSupp(/i)} (with weights i): 



(52) 



ieJnSupp(h) 



is JnSupp(h) 



Finally, we plug (52) into (51) and use convexity and some simple algebra: 



m < E 



^0^(0) + 1 Y <p'mh^'^] 

jeJnSupp(/i) 



JeJ'ieJnSupp(h) 



JeJ i=l i=l J<^J 



^E^II^^^^^IIw = ilNli- 



□ 



i=l 



Besides the usefulness of the above result in deriving ESO inequalities, it is interesting on its 
own for the following reasons. 
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1. Block Lipschitz continuity of V/. The DSO inequality ( |47[ ) is a generalization of (12) 
since (12) can be recovered from (47) by choosing h with Supp(/i) = {i} for i € [n]. 



2. Global Lipschitz continuity of V/. The DSO inequality also says that the gradient of / 
is Lipschitz with Lipschitz constant uj with respect to the norm || • 

fix + h)< fix) + {Vfix),h) + ^\\h\\l. (53) 

Indeed, this follows from ( |47[ ) via maxjgj- |Jn Supp(/i)| < maxjgj- |J| = uj. For uj = n this 
has been shown in |7]; our result for partially separable functions appears to be new. 

3. Tightness of the global Lipschitz constant. The Lipschitz constant w is "tight" in the 



following sense: there are functions for which uj cannot be replaced in (53) by any smaller 
number. We will show this on a simple example. Let fix) = ^||j4x|P with A G R,™^"- (blocks 
are of size 1). Note that we can write /(x + h) = fix) + {V fix), h) + Ah, and that 

L = (Li, . . . , L„) = diag(A^^). Let D = Diag(L). We need to argue that there exists A for 



which a '= max/i^o ^ Ufaip"^^ ~ Since we know that a < uj (otherwise (53) would not hold), 
all we need to show is that there is A and h for which 



h^A^Ah 



ujh^Dh. 



(54) 



Since fix) = Yl^ii^J^)'^^ where Aj is the j-th row of A, we assume that each row of A has 
at most UJ nonzeros (i.e., / is partially separable of degree u). Let us pick A with the following 
further properties: a) A is a 0-1 matrix, b) all rows of A have exactly uj ones, c) all columns 
of A have exactly the same number (fc) of ones. Immediate consequences: Li = k for all i, 
D = kin and ujm = kn. If we let em be the mxl vector of all ones and be the n x 1 vector 
of all ones, and set h = /c~^/^e„, then 

h^A^Ah = lelA^Aen = HujCm)'^ iojem) = ^ = ujn = w^e^A;/„e„ = uh^Dh, 



establishing (54). Using similar techniques one can easily prove the following more general 
result: Tightness also occurs for matrices A which in each row contain uj identical nonzero 
elements (but which can vary from row to row). 



4.2 ESO for a convex combination of samplings 

Let S*!, 5*2, ... , Sm be a collection of samplings and let q G R™ be a probability vector. By qjSj 
we denote the sampling S given by 

m 

p(^S = s)=Y,QjPiSj = S). (55) 
i=i 

This procedure allows us to build new samplings from existing ones. A natural interpretation of S 
is that it arises from a two stage process as follows. Generating a set via S is equivalent to first 
choosing j with probability qj, and then generating a set via Sj. 

Lemma 8. Let Si, S2, ■ ■ ■ , Sm be arbitrary samplings, q € R™ a probability vector and n : 2^"'^ — t- R 
any function mapping subsets of [n] to reals. If we let S = qjSj, then 
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, n. 



(u) n\s\] = Y.T=iqMSM 

(in) P(i G 5) = YJj=i Ij'Pi.i ^ Sj), for any i = 1,2, 
(iv) If Si, . . . ,Sm are uniform (resp. doubly uniform), so is S. 
Proof Statement (i) follows by writing £[^(5)] as 

mm m 

Y,P{S = SMS) ^ j;5^Q,P(5, = S)KiS) = Y.^^Y.^(S, = S)k{S) = ^g,E[K(S,)]. 

S S 3=1 j=l S j=l 

Statement (ii) follows from (i) by choosing n{S) = 15"!, and (iii) follows from (i) by choosing k as 
follows: K,{S) = 1 if « G S" and ^(5") = otherwise. Finally, if the samplings Sj are uniform, from 
(32) we know that P(i G Sj) = E[|S'j|]/n for all i and j. Plugging this into identity (iii) shows 
that P(i G S) is independent of i, which shows that S is uniform. Now assume that Sj are doubly 
uniform. Fixing arbitrary r G {0} U [n], for every S C [n] such that \S\ = r we have 



p(5 = 5)ef;,,.P(5, = 5) = f;,,^ 

1 = 1 1 = 1 ^t) 



3=1 i=i 

As the last expression depends on S via \S\ only, S is doubly uniform. □ 
Remarks: 

1. If we fix S" C [n] and define k{S') = 1 if 5" = 5 and k{S') = otherwise, then statement (i) 
of Theorem [s] reduces to (55). 

2. All samplings arise as a combination of elementary samplings, i.e., samplings whose all weight 
is on one set only. Indeed, let S be an arbitrary sampling. For all subsets Sj of [n] define Sj 
by Y*[Sj = Sj) = 1 and let qj = P{S = Sj). Then clearly, S = Ylj QjSj- 

3. All doubly uniform samplings arise as convex combinations of nice samplings. 

Often it is easier to establish ESO for a simple class of samplings (e.g., nice samplings) and then 
use it to obtain an ESO for a more complicated class (e.g., doubly uniform samplings as they arise 
as convex combinations of nice samplings). The following result is helpful in this regard. 

Theorem 9 (Convex Combination of Uniform Samplings). Let Si, ... , Sm be uniform samplings 
satisfying {f,Sj) ~ ES0{l3j,Wj) and let q G R™' be a probability vector. If^jQjSj is not nil, then 

m \ I m 

f,^q,S, ]^ES0i ,^g,E[|5,|]/3,u;, 
j=i I \2^j=iQjH\Sj\\ j^i 
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A def . 



Proof. First note that from part (iv) of Lemma |8j we know that S = Ylj Qj^j is uniform and hence 
it makes sense to speak about ESO involving this samphng. Next, we can write 



E 



fix + h 



^ P(5 = S)f{x + h[s]) ^ E E ^^^(^^ = ^)/(^ + ^^1^ 
s s j 



It now remains to use ( 42 ) and part (ii) of Lemma [8] 

(|42ll 



E'Z, /(^) + ^ (V/(x),/i) + 



(Lemma [s] (ii)) 



/(x) + M(^(V/(x),/i) + 
/(x) + Mf(V/(x),/i) + 



E,g,E[|5,|]/3,||fe|| 
2E,9.E[|5,|] 



2E,9jE[|5,|] 



where t/; = gjE[|5j |]/3ju;j. In the third step we have also used the fact that E[|S'|] > which 
follows from the assumption that S is not nil. □ 

4.3 ESO for a conic combination of functions 

We now establish an ESO for a conic combination of functions each of which is already equipped 
with an ESO. It offers a complementary result to Theorem [9] 

Theorem 10 (Conic Combination of Functions). // (/j, S) ~ ESO{Pj, wj) for j = 1, . . . ,m, then 
for any ci , . . . , > we have 



^c,f„s\r^ES0il,^c,^,w,]. 



Proof Letting f = J2j ^jfj, we get 



E 



Ecj/j [x + h 



IS] 



EcjE fj(x + h 



IS] 



E^^-/^-(^) + ^ I E^^-(w,(^), /i) + E 



/(x) + ^((V/(x),/i) + i||/i|||^ 



□ 
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5 Expected Separable Overapproximation (ESO) of Partially Sep- 
arable Functions 

Here we derive ESO inequalities for partially separable smooth functions / and (proper) uniform 



(Section 5.1), nonoverlapping uniform (Section |5.2[), nice (Section 5.3) and doubly uniform (Sec 



tion 5.4) samplings. 



5.1 Uniform samplings 

Consider an arbitrary proper sampling S and let u = (ui, . . . , i^n)^ be defined by 



def „ 
z/j = E 



mm{oo,\S\} \ ie S = ^ ^ P(5) min{u;, |5|}, ie 



n 



Lemma 11. If S is proper, then 



E [fix + /ij^j)] < fix) + (V/(x), h)p + i 



(56) 



Proof. Let us use Theorem jTj with h replaced by h^^y Note that maxjgj- 1 J n Supp(/ij^j)| < 
maxjgj- |Jn S"! < minju;, \S\}. Taking expectations of both sides of (47) we therefore get 



E 



fix + h 



fix) + B {Vfix),h,, 



[S]/ 



+ iE 



min{a;,|5|}||/i,- 
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fix) + (V/(x), h)p+lB \mm{u;, \S\}\\h^^^ \\l 



(57) 



It remains to bound the last term in the expression above. Letting 9i = Li\\h^^^\'^.-^, we have 



E 



min{a;,|5'|}||/ij_5]||i 



E 



5^min{a;,|5|}L,||/^»||2) 



^ PiS)Y,^m{u;,\Sm 

Scfnl je5 



n n n 

Y^OiY, mW^, |5|}P(5) = J^^iPiE [min{a;, |5|} | i G ^] = Y,<^^p^u, 



i=l S:ieS 



i=l 



pQvQL- 



□ 



i=l 



(58) 



The above lemma will now be used to establish ESO for arbitrary (proper) uniform samplings. 
Theorem 12. If S is proper and uniform, then 

if,S)^ ESOil,uQL). (59) 
//, in addition, Pi\S\ = r) = 1 (we say that S is T-uniform), then 

if, S) ~ ESOim.m{uj, r}, L). (60) 



Moreover, ESO (60) is monotonic. 
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Proof. First, (59) follows from (|56|) since for a uniform sampling one has pi = E[|S'|]/n for all i. 



If Pd^l = r) = 1, we get = min{a;,r} for all i; (60) therefore follows from (59). Let us now 
establish monotonicity. Using the deterministic separable overapproximation (47) with h = /ij^j, 



Fix + /ij^]) < fix) + (V/(x), h^g^) + max^—^\\h^g^ \\l + + /i^^j 



\JnS\ I 

< /(x) + (V/(x), /ij^j) + f \\l + f)(x + (61) 

= f{x) + Y.^{{Vfix),uM'^) + /M||/,W||2^ + + (62) 
ies^ V ■' i^s 

Now let hix) = aigmiiih Hp^mix, h) and recall that 
Hp,Ux, h) = fix) + {Vfix), h) + f + 0(x + /i) 

n n 

= fi^) + y. ((V/(a:), + + + /i«)) = fix) + V K,(/i«). 



^ ^ l-^^ •^J" / -r — ll"' ' ll(i) ^ "jV-^ ' ' -r -jy — J l-^; -r / ^ ' 
1=1 i=l 

So, by definition, ihix))^'^^ minimizes Ki(t) and hence, (/i(x))j_jj (recall ([?])) minimizes the upper 
bound ( |62[ ). In particular, (/i(j;))j^j is better than a nil update, which immediately gives Fix + 

ihix))^^^) < fix) + Eies '^^(0) + ^*(^^'^) = ^(^)- ° 

Besides establishing an ESO result, we have just shown that, in the case of r-uniform samplings 
with a conservative estimate for /3, PCDMl is monotonic, i.e., Fix^+i) < -F(xfc). In particular, 
PCDMl and PCDM2 coincide. We call the estimate f3 = minjo;, r} "conservative" because it can 



be improved (made smaller) in special cases; e.g., for the r-nice sampling. Indeed, Theorem 14 
establishes an ESO for the r-nice sampling with the same w iw = L), but with /3 = 1 + 



n-l 



which is better (and can be much better than) min{ti;,r}. Other things equal, smaller /3 directly 
translates into better complexity. The price for the small /3 in the case of the r-nice sampling 
is the loss of monotonicity. This is not a problem for strongly convex objective, but for merely 
convex objective this is an issue as the analysis techniques we developed are only applicable to the 



monotonic method PCDM2 (see Theorem 19). 



5.2 Nonover lapping uniform samplings 



Let be a (proper) nonoverlapping uniform sampling as defined in (22). If i G S*-^, for some 
j G {1,2,...,/}, define 

7j = maxlJnS^, (63) 

and let 7 = (71, . . . ,7„)'^. 

Note that, for example, if S is the serial uniform sampling, then I = n and = {j} for 
j = 1,2, ... ,Z, whence 7^ = 1 for all i G [n]. For the fully parallel sampling we have / = 1 and 

= {1, 2, . . . , n}, whence 'yi = uj for all i £ [n]. 

Theorem 13. If S a nonoverlapping uniform sampling, then 

if,S)^ESOil,^QL). (64) 

Moreover, this ESO is monotonic. 
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Proof. By Theorem j7j used with h replaced by h^gjj for j = 1,2, ... ,1, we get 



f{x + h[sj]) < fix) + {V f{x),h[sj]) + max^-^^^\\h[sj]\\l. 



(65) 



Since 5 = with probabihty j, 



E 



fix + \s]) 



\jns^\ I 



J63l 



fix) + } {Vfix),h) + 1 ^ ^ ^,L,||/i«||2) 
/(^) + T((V/(x),/i) + |||/i||^0i), 



which estabhshes (64). It now only remains to establish monotonicity. Adding f](x + h^gyj to (65) 

with replaced by S, we get F(x + /i^^j) < fix) + (V/(x), + f + ^ix + h^^^). From 

this point on the proof is identical to that in Theorem [l2l following equation (61). □ 



5.3 Nice samplings 

In this section we establish an ESO for nice samplings. 
Theorem 14. // S is the r-nice sampling and t ^ 0, then 

if,S)^ESO 



max(l, n — 1 

Proof. Let us fix x and define (f) and cf)"^ as in the proof of Theorem [Tj Since 

3? 



(66) 



E 



,^(^[^]) =E f(x + h,^,)-fix)-{Vfix),h,^,) 



IS] 



E 



fix + h^^^) -f{x)-^{Vfix),h), 



it now only remains to show that 



E 



0(/i[^]) 



< ( 1 _|_ 

— 2n \ ' max 



(67) 



Let us now adopt the convention that expectation conditional on an event which happens with 



def 

probability is equal to 0. Letting r/j = | J H 5"!, and using this convention, we can write 



E 



IS]- 



[S]' 



n 

Y,Y.^[t''(\s])\VJ = k\Pirij = k) 

k=0 J&J 
n 

Y,FiVJ = k)Y.B[4>'ih^S^)\rjj = k 



k=0 



(68) 



Note that the last identity follows if we assume, without loss of generality, that all sets J have the 
same cardinality w (this can be achieved by introducing "dummy" dependencies). Indeed, in such 
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a case P(??j = k) does not depend on J. Now, for any k > 1 for which P^rjj = k) > (for some J 
and hence for all), using convexity of (j)'^ can now estimate 



E 



E 



< E 



I kU,h^'^ \r]j = k 

ieJns / 



i^Jns 



mi 



^^</)'^(fcC/./i«). (69) 
ieJ 



If we now sum the inequalities (69) for all J G ^7, we get 



Jej 



n n 



i=i JeJ 

n 
i=l 



i=l 



i)\\2 k' 



Finally, (67) follows after plugging (70) into (68): 



E 



'^(^5])| <E^ivj = k)i;\\h\\i = Uh\\in\jns\'] ^ ^ (i + ) \\hh. 



(70) 



□ 



5.4 Doubly uniform samplings 

We are now ready, using a bootstrapping argument, to formulate and prove a result covering all 
doubly uniform samplings. 



Theorem 15. If S is a (proper) doubly uniform sampling, then 

( 

{f,S)r^ESO 



(,, ufmM A 



V 



~'~ max(l, n — 1) 



Proof Letting gf^ = PdS"! = k) and d = max{l,n — 1}, we have 



E 



n 

fix + /i[^])] = E [e [fix + /i[^j) I \S\\] =Y,QkB[fix + h^§^) I \S\ = k 



k=0 



n 

Yl'^ [m + I {{Vfix),h) + i (l + (-^m^^ 



k=0 



(71) 



/(^) + 'n E 9^^(V/(x), /,) + ^ ^ [A: (1 - + k'^] \\h\\l 



/(x) + M(V/(x),M + ^(E[|^|](l 



2 

L- 
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This theorem could have alternatively been proved by writing 5 as a convex combination of nice 
samplings and applying Theorem [91 □ 



Note that Theorem fTSl reduces to that of Theorem 14 in the special case of a nice sampling, and 



gives the same result as Theorem 13 in the case of the serial and fully parallel samplings. 



5.5 Summary of results 

In Table [2] we summarize all ESO inequalities established in this section. The first three results 
cover all the remaining ones as special cases. 



S 

uniform 



a 

n 



W 



Monotonic? 
No 



Follows from 
Thm 12 



nonoverlapping uniform 



70i 



Yes 



Thm 



13 



doubly uniform 



E[|5|] 



1 + 



1-1 



max(l,?i— 1) 



L 



No 



Thm 



15 



12 



r-uniform 



r-nice 



minjo;, r} 



1 + 



(a;-l)(r-l) 
max(l,n— 1) 



L 
L 



Yes 
No 



Thm 
Thm 



14 



15 



(r, pf,)-binomial 



rPb 



1 + 



Ph(a;-l)(r-l) 
max(l,ri— 1) 



L 



No 



Thm 



15 



serial 



1 



L 



Yes 



Thm 



13 



14 



15 



fully parallel 



1 



L 



Yes 



Thm 



13 



14 



15 



Table 2: Values of parameters j3 and w of an ESO for partially separable / (of degree oj) and various 

proper uniform samplings S: {f,S) ~ ESO{l3,w). We also include a = as it appears in the 

complexity estimates. 



6 Iteration Complexity 



In this section we prove two iteration complexity theorems. The first result (Theorem 19) is for 



non-strongly-convex F and covers PCDM2 with no restrictions and PCDMl only in the case when 



a monotonic ESO is used. The second result (Theorem 20) is for strongly convex F and covers 
PCDMl without any monotonicity restrictions. Let us first establish two auxiliary results. 

Lemma 16. For all x G domF, Hp^w{x,h{x)) < miHy^^N {F (y) + ^— — x\\'^}. 
Proof. 

Hi3^^{x,h{x))^ min Hf3^y,{x,y - x) = min f{x) + {V f{x),y - x) + n{y) + ^\\y - x\\l, 

nnn /(y) - ^\\y - x\\l + n{y) + f ||y - x\\l. □ 



y£R 



N 
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Lemma 17. (i) Let x* he an optimal solution of ([T]), x G domF and let R = \\x — x*\\u)- Then 

y\PR^ < -F*), otherwise, 

(a) If Hf{w) + Uniw) > and j3 > ^f{w), then for all x G domF, 

Hp^^ix, h{x)) -F*< ^—l^(F{x) - F*). (73) 
Proof Part (i): Since we do not assume strong convexity, we have Hfiw) = 0, and hence 

(Lemma llGI l 

Hj3^y,{x,h{x)) < min {F(y) + f ||y - 

yen" 

< min {F(Ax* + (l-A)a;) + ^||x-a;*||^} 

< min {F(x) - X(F(x) - F*) + ^R^}. 

Minimizing the last expression in A gives A* = min {l, {F{x) - F*)/{/3R^)}] the result follows. Part 
(ii): Letting fif = fif{w), fin = f^niw) and A* = {fij + + fJ-n) < 1, we have 



(Lemma I16ll 
< 


min iF(y) + — 


2 \\y -^llui} 


< 


min iF(Xx* + 

Ae[o,i] 


(1 A)x) + ('^--/)^^||x x*\\l} 


< 


min |AF* + (1 

Ae[o,i] 


A)F(X) (M/+A'n)A(l-A)-(/3-M/)A^||^ ^*||2^| 


< 


F{x) - X*{F{x) 


-F*). 



The last inequality follows from the identity {fif + /Un)(l — A*) — {(3 — /^/)A* =0. □ 

We could have formulated part (ii) of the above result using the weaker assumption fipiw) > 0, 
leading to a slightly stronger result. However, we prefer the above treatment as it gives more insight. 

6.1 Iteration complexity: convex case 

The following lemma will be used to finish off the proof of the complexity result of this section. 

Lemma 18 (Theorem 1 in |13j ) . Fix xq G and let {xk}k>o be a sequence of random vectors 
in with x^+i depending on Xk only. Let (p : R^ — t- R 6e a nonnegative function and define 
^fc = (j){xk)- Lastly, choose accuracy level < e < ^O; confidence level < p < 1, and assume that 
the sequence of random variables {ik}k>o is nonincreasing and has one of the following properties: 

(i) E[^fc+i I Xk] < (1 — ^)^k, for all k, where ci > e is a constant, 

(ii) E[^fc+i I Xk] < (1 — -)^k, for all k such that ^k ^ where C2 > 1 is a constant. 
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If property (i) holds and we choose K > 2 + ^(1 ~ ^ + log(p)); or if property (ii) holds, and we 
choose K >C2 log(^), then P{S,k < e) > 1 - /O- 

This lemma was recently extended in [lOj so as to aid the analysis of a serial coordinate descent 
method with inexact updates, i.e., with h{x) chosen as an approximate rather than exact minimizer 
of Hi^l(x,-) (see (17)). While in this paper we deal with exact updates only, the results can be 
extended to the inexact case. 

Theorem 19. Assume that {f,S) ~ ES0{(3,'w), where S is a proper uniform sampling, and let 
a = Choose xq £ satisfying 

TZw{xq,x*) max{\\x — x*\\w ■ F{x) < F{xq)} < +oo, (74) 

X 

where x* is an optimal point of ([T]). Further, choose target confidence level < p < 1, target 
accuracy level e > and iteration counter K in any of the following two ways: 

(i) e< F{xo) - F* and 

2(|)max{7^^(xo,x*),^^^i^} / , /^xx 



(ii) e < min{2 (^^j7^4(a;o,2;*),F(xo) -F*} and 

A-a'(°)^f°-"'.o,(^'-°);^-). (7.) 

If {xk}, k >0, are the random iterates of PCDM (use PCDMl if the ESO is monotonic, otherwise 
use PCDM2), then P{F{xk) - F* < e) > 1 - p. 

Proof Since either PCDM2 is used (which is monotonic) or otherwise the ESO is monotonic, we 
must have F{xk) < F{xq) for all k. In particular, in view of (74) this implies that \\xk — x*\\w < 
TZuj{xq, X*). Letting = F{xj?) — F* , we have 

E[Cfc+i I Xk] S (1 - a)^k + a{H/3^y,{xk, h{xk)) - F*) 
1I72I1 



1 - a)ik + a max <j 1 - — — — — ^ \ e,k 



= maxn-— ^Ti2"'l-oKfc 

t 2l3\\xk-x*\\i 2 J 

Consider case (i) and let ci = 2^ max{7^^(xo, x*), ^}. Continuing with (77), we then get 

nik+i I < (1 - 1)6 

for all A; > 0. Since e < .^0 < ci, it suffices to apply Lemma [l8[i). Consider now case (ii) and let 



C2 



2^ "^™^^°'^ ^ ■ Observe now that whenever > e, from (|77[) we get E[(^fc_(_i | Xk] < (1 — 7r)ik- 



C2 - 



By assumption, C2 > 1, and hence it remains to apply Lemma PLSfii) . □ 
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The important message of the above theorem is that the iteration complexity of our methods in 
the convex case is 0{^\). Note that for the serial method (PCDMl used with S being the serial 
sampling) we have a = ^ and /3 = 1 (see Table [ij), and hence ^ = n. It will be interesting to study 
the parallelization speedup factor defined by 



parallelization speedup factor 



- of the serial method 



n 



- of a parallel method - of a parallel method 



(78) 



Table [3j computed from the data in Table [2] gives expressions for the parallelization speedup 
factors for PCDM based on a DU sampling (expressions for 4 special cases are given as well). 



5 



doubly uniform 



Parallelization speedup factor 



1+ 



E[|5|] 

(a;-l)((E[|Sp]/E[|S|])-l) 
max(l,n— 1) 



(r, pfe)-binomial 

r-nice 

fully parallel 
serial 



1 (a;-l)(r-l) 
Pb max(l,n— 1) 



^ (c.-l)(r-l) 
max(l,rt— 1) 



n 
w 



Table 3: Convex F: Parallelization speedup factors for DU samplings. The factors below the line 
are special cases of the general expression. Maximum speedup is naturally obtained by the fully 
parallel sampling: — . 



The speedup of the serial sampling (i.e., of the algorithm based on it) is 1 as we are comparing 
it to itself. On the other end of the spectrum is the fully parallel sampling with a speedup of ^. 
If the degree of partial separability is small, then this factor will be high — especially so if n is 
huge, which is the domain we are interested in. This provides an affirmative answer to the research 
question stated in italics in the introduction. 



Let us now look at the speedup factor in the case of a r-nice sampling. Letting r 
[0, 1] (degree of partial separability normalized), the speedup factor can be written as 



^-1 

max(l,n— 1) 



s{r) 



l + r(r- 1) 



Note that as long as r < ^^^^ ~ ^ , the speedup factor will be at least ^ . Also note that max{ 1 , ^ } < 
s{r) < min{r, ^}. Finally, if a speedup of at least s is desired, where s G [0, one needs to use at 



least 



l/s- 



processors. For illustration, in Figure 



for small values of r, the speedup is significant aiid can be as large as the number of processors (in 



we plotted s(r) for a few values of r. Note that 
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the separable case). We wish to stress that in many apphcations uj will be a constant independent 
of n, which means that r will indeed be very small in the huge-scale optimization setting. 




r = ((o-1)/(n-1) 

Figure 1: Parallelization speedup factor of PCDMl /PCDM2 used with r-nice sampling as a function 
of the normalized/relative degree of partial separability r. 



6.2 Iteration complexity: strongly convex case 

In this section we assume that F is strongly convex with respect to the norm || • ||^ and show that 
F{xk) converges to F* linearly, with high probability. 

Theorem 20. Assume F is strongly convex with fJ-f{w) + ^J-n{w) > 0. Further, assume {f,S) ~ 

ES0{f3, w), where S is a proper uniform sampling and let a = Choose initial point xq G R-'^, 

target confidence level < p < 1, target accuracy level < e < F{xo) — F* and 

K>L f + ,og(^<-°)-^-V (79) 

a ij.f{w) + pn{w) \ ep J 

If {xk} are the random points generated by PCDMl or PCDM2, then P{F{xk) — F* < e) > 1 — p. 
Proof. Letting = F{xk) — F* , we have 

m+i I Xk] ? (1 - a)a + a{Hp,Uxk, Hxk)) - F*) 9 (l - def _ 



Note that < 7 < 1 since < a < 1 and /? > Pf{w) by (46). By taking expectation in x^, we 
obtain E[^fc] < (1 - -if^Q. Finally, it remains to use Markov inequality: 

p(eA- > e) < < ^ ' T p. □ 
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Instead of doing a direct calculation, we could have finished the proof of Theorem 20 by applying 
Lemma |18[ ii) to the inequality E[^fc_|_i | Xk] < (1 — 7)?fc- However, in order to be able to use 
Lemma [Tsj we would have to first establish monotonicity of the sequence {£,k}, k > 0. This is not 
necessary using the direct approach of Theorem 20 Hence, in the strongly convex case we can 



analyze PCDMl and are not forced to resort to PCDM2. Consider now the following situations: 

1. Hf{w) = 0. Then the leading term in (79) is li^^/Mili!)^ 

2. fin{w) = 0. Then the leading term in (|79|) is 

3. fJ.n{w) is "large enough". Then 



a 



/3+^n(ui) 



1 and the leading term in ( 79 ) is 



In a similar way as in the non-strongly convex case, define the parallelization speedup factor as the 
ratio of the leading term in (79) for the serial method (which has a = ^ and /3 = 1) and the leading 
term for a parallel method: 



parallelization speedup factor 



n 



n 



l_l3+jin(w)_ 



of a parallel method 



/9+Atn(w) 



(80) 



First, note that the speedup factor is independent of Further, note that as fin{w) — >• 0, 



the speedup factor approaches the factor we obtained in the non-strongly convex case (see (78) 
and also Table [3|. That is, for large values of fJ,n{w), the speedup factor is approximately equal 
an = 'E,[\S\], which is the average number of blocks updated in a single parallel iteration. Note that 
thuis quantity does not depend on the degree of partial separability of /. 



7 Numerical Experiments 



In Section 7.1 we present preliminary but very encouraging results showing that PCDMl run on 
a system with 24 cores can solve huge-scale partially-separable LASSO problems with a billion 
variables in 2 hours, compared with 41 hours on a single core. In Section [7. 2 1 we demonstrate that 
our analysis is in some sense tight. In particular, we show that the speedup predicted by the theory 
can be matched almost exactly by actual wall time speedup for a particular problem. 

7.1 A LASSO problem with 1 billion variables 

In this experiment we solve a single randomly generated huge-scale LASSO instance, i.e., ([T]) with 

f{x) = '^\\Ax-b\\l n{x) = \\xh, 

where A = [ai, . . . , a„] has 2 x 10^ rows and N = n = 10^ columns. We generated the problem 
using a modified primal-dual generator |13| enabling us to choose the optimal solution x* (and 
hence, indirectly, F*) and thus to control its cardinality ||x*||o, as well as the sparsity level of A. In 
particular, we made the following choices: ||x*||o = 10^, each column of A has exactly 20 nonzeros 
and the maximum cardinality of a row of ^ is a; = 35 (the degree of partial separability of /). The 
histogram of cardinalities is displayed in Figure |2] 

We solved the problem using PCDMl with r-nice sampling S", /3 = 1 + "^^ and w = L = 

(||ai II2, • • • , llflnlli)) for ''" = 1)2, 4, 8, 16, 24, on a single large-memory computer utilizing r of its 24 
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2.5 
2 




cardinality 

Figure 2: Histogram of the cardinalities of the rows of A. 



cores. The problem description took around 350GB of memory space. In fact, in our implementation 
we departed from the just described setup in two ways. First, we implemented an asynchronous 
version of the method; i.e., one in which cores do not wait for others to update the current iterate 
within an iteration before reading a^^+i and proceeding to another update step. Instead, each core 
reads the current iterate whenever it is ready with the previous update step and applies the new 
update as soon as it is computed. Second, as mentioned in Section [3j the r-independent sampling is 
for r ^ n a very good approximation of the r-nice sampling. We therefore allowed each processor 
to pick a block uniformly at random, independently from the other processors. 

Choice of the first column of Table |4} In Table |4] we show the development of the gap 
F{xk) — F* as well as the elapsed time. The choice and meaning of the first column of the table, 
needs some commentary. Note that exactly rk coordinate updates are performed after k iterations. 
Hence, the first column denotes the total number of coordinate updates normalized by the number 
of coordinates n. As an example, let ri = 1 and T2 = 24. Then if the serial method is run for 
ki = 24 iterations and the parallel one for A;2 = 1 iteration, both methods would have updated the 
same number (rifci = r2/c2 = 24) of coordinates; that is, they would "be" in the same row of Table |4] 
In summary, each row of the table represents, in the sense described above, the "same amount of 
work done" for each choice of r. 

Progress to solving the problem. One can conjecture that the above meaning of the phrase 
"same amount of work done" would perhaps be roughly equivalent to a different one: "same progress 
to solving the problem". Indeed, it turns out, as can be seen from the table and also from Fig- 
ure |3][|a), that in each row for all algorithms the value of F{xk) — F* is roughly of the same order of 
magnitude. This is not a trivial finding since, with increasing r, older information is used to update 
the coordinates, and hence one would expect that convergence would be slower. It does seem to be 
slower — the gap F{xk) — F* is generally higher if more processors are used — but the slowdown is 
limited. Looking at Table |4] and/or Figure |3]^a), we see that for all choices of r, PCDMl managed 
to push the gap below 10~ after 34n to 37n coordinate updates. 

The progress to solving the problem during the final 1 billion coordinate updates (i.e., when 
moving from the last-but-one to the last nonempty line in each of the columns of Table |4] showing 
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Table 4: A LASSO problem with 10*^ variables solved by PCDMl with r = 1, 2, 4, 8, 16 and 24. 



F{xk) — F* ) \s remarkable. The method managed to push the optimality gap by 9-12 degrees of 
magnitude. We do not have an explanation for this phenomenon; we do not give local convergence 
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(a) For each r, PCDMl needs roughly the same number (b) DoubUng the number of cores corresponds to 



of coordinate updates to solve the problem. 



roughly halving the number of iterations. 
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(c) Doubling the number of cores corresponds to roughly (d) Parallelization speedup is essentially equal to the 
halving the wall time. number of cores. 



Figure 3: Four computational insights into the workings of PCDMl. 



estimates in this paper. It is certainly the case though that once the method managed to find the 
nonzero places of x*, fast local convergence comes in. 

Parallelization speedup. Since a parallel method utilizing r cores manages to do the same 
number of coordinate updates as the serial one r times faster, a direct consequence of the above 
observation is that doubling the number of cores corresponds to roughly halving the number of 
iterations (see Figure [sf^b) . This is due to the fact that uj n and t <^ n. It turns out that 
the number of iterations is an excellent predictor of wall time; this can be seen by comparing 
Figures [sj^b) and[3|^c). Finally, it follows from the above, and can be seen in Figure [3]^d) , that the 
speedup of PCDMl utilizing r cores is roughly equal to r. Note that this is caused by the fact that 
the problem is, relative to its dimension, partially separable to a very high degree. 
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7.2 Theory versus reality 



In our second experiment we demonstrate numerically that our parallelization speedup estimates 
are in some sense tight. For this purpose it is not necessary to reach for complicated problems and 
high dimensions; we hence minimize the function ^||Ax — &II2 with A S R,3000xiooo^ Matrix A was 
generated so that its every row contains exactly u non-zero values all of which are equal (recall the 



construction in point 3 at the end of Section 4.1). 



❖ 03=5 

« a)=10 
□ £0=50 

O (B=100 




— o o ocxxxn ) 



10 10' 
# of cores (t) 



10" 



Figure 4: Theoretical speedup factor predicts the actual speedup almost exactly for a carefully 
constructed problem. 

We generated 4 matrices with uj = 5, 10, 50 and 100 and measured the number of iterations 
needed for PCDMl used with r-nice sampling to get within e = 10~^ of the optimal value. The 
experiment was done for a range of values of r (between 1 core and 1000 cores). 

The solid lines in Figure [4] present the theoretical speedup factor for the r-nice sampling, as 
presented in Table [3j The markers in each case correspond to empirical speedup factor defined as 

^ of iterations till e-solution is found by PCDMl used with serial sampling 
# of iterations till e-solution is found by PCDMl used with r-nice sampling 
As can be seen in Figure|4j the match between theoretical prediction and reality is remarkable! A 
partial explanation of this phenomenon lies in the fact that we have carefully designed the problem 
so as to ensure that the degree of partial separability is equal to the Lipschitz constant a of V/ (i.e.. 



that it is not a gross overestimation of it; see Section 4.1). This fact is useful since it is possible to 
prove complexity results with u replaced by a. However, this answer is far from satisfying, and a 
deeper understanding of the phenomenon remains an open problem. 
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